Intro to RNA-seq Analysis
Hosted by the IIGB Bioinformatics Core
About the Core
IIGB Bioinformatics Core
- Data analysis
- RNA-seq, ChIP-seq
- scRNA-seq
- DNA-seq
- Other
- Project consultations
- Method development
- Daily (2-3PM)
- In-person or via Zoom (by appt)
- Calendly
Workshop Outline
- Core intro
- Server Access
- Introduction to RNA-seq
- RNA-seq workflow (in depth)
- Hands-on with Jupyter Notebook
Accessing the Web-based Notebook
Logging in to the server via OnDemand
You should receive an email with login information if you’ve never had an account on the HPCC. For those with an account already, you can use the same credentials to log into the server.
Open up your browser and type the following URL: https://ondemand.hpcc.ucr.edu
Use the provided login credentials to log in to OnDemand. Then you’ll be ask for authentication password and two-factor validation with DUO.
After logging in, you’ll see the HPCC OnDemand Welcome Page
Requesting a Juypter Notebook Interactive Session
We will use the interactive Jupyter notebook for the workshop tutorials. To request the interactive app, click on Interactive Apps and select Jupyter Notebook.
You will see an app launch form. Fill in the following information:
Then click Launch and wait for resources to be gathered for your app.
Once the app is ready, click Connect to Jupyter to launch the Jupyter notebook.
Setting up the analysis directory
We will clone the GitHub repository for this workshop. The repo is stored at: GitHub Repo
Cloning the repository will download all the files and directories from GitHub to your server/computer.
Cloning the repository:
Once the cloning is complete, we can go into the newly-downloaded workshop repository to run our analyses.
# change to the workshop directory
cd RNA-Seq-Workshop-Jupyter-notebook
# lists directory content
ls
The home directory structure for the workshop should look like this:
├── README.md
├── RNA-seq-workflow.ipynb
├── analysis
├── code
├── genome
├── index
├── log
├── metadata
└── rawCloning the repository will give you the most up-to-date files. To get the latest updates of the repository later on, you don’t need to clone again. Instead, you can pull the updates from GitHub to your server/computer. Make sure you are in the home directory of the workshop before running the command.
The pull command will overwrite any changes you’ve made locally. If you want to keep the changes and update the new things, you can run:
git stash: This command stashes your local changes, creating a temporary backup.git pull: This command fetches the changes from the remote repository and merges them into the local branch.git stash apply: This command applies the stashed changes back to the working directory.
Alternatively, if you want a complete reset of your local directory with the updates from GitHub, you can run:
git reset --hard HEAD: This command resets the current branch to the HEAD commit of the remote repository, discarding any local changes.git clean -f -d: This command removes any untracked files and directories from the working directory.git pull: This command fetches the changes from the remote repository and merges them into the local branch.
RNA-seq Background
What is RNA-seq?
RNA sequencing or RNA-seq is one of many methods used for gene expression studies by obtaining a snapshot of the RNA molecules within a biological system.
Reference: Van den Berge et. al (2019) Annu Rev Biomed Data Sci
What is RNA-seq?
Next-generation sequencing (NGS) technologies
- Short-read based (e.g. Illumina)
- 35 - 300 bases
- Long-read based (e.g., PacBio, Oxford Nanopore)
- Several kilobases
Sequencing resolution
- Bulk tissue (bulk RNA-seq)
- Laser-capture microdissection (LCM + RNA-seq)
- Single-cell (scRNA-seq)
- Spatial transcriptomics (spatial + scRNA-seq)
RNA-seq analysis workflow
Sample Dataset Demo
| Category | Description |
|---|---|
| BioProject | PRJNA936435 |
| GEO Series | GSE225576 |
| Title | An atlas of the aging mouse transcriptome and proteome reveals the features of age-related post-transcriptional dysregulation (PMID: 39353907) |
| Organism | Mus musculus |
| Strain | C57BL/6J |
| Genotype | Wild type |
| Replicates | 4 |
| Organ | aorta, brain, heart, kidney, liver, lung, muscle, skin |
| Age (months) | 6, 15, 24, 30 |
| Library kit | Illumina Stranded mRNA Prep kit |
| Sequencing | Illumina paired-end 150bp (NovaSeq 6000) |
Workflow Outline
- Create metadata
- QC using
Fastqcandtrim_galore - Genome indexing for
STARaligner - Sequence alignment with
STAR - Gene quantification using
featureCounts - Differential expression analysis with
DESeq2 - Data visualization in IGV
Analysis Toolkit
| Program | Type | Software | Reference | Misc |
|---|---|---|---|---|
| Fastqc | QC | Software | Ref | |
| Trim_galore | QC | Software | Ref | |
| Multiqc | QC | Software | Ref | Manual |
| STAR | Alignment | Software | Ref | Manual |
| FeatureCounts | Quantification | Software | Ref | |
| DESeq2 | Differential Expression | Software | Ref | Guide |
| IGV | Visualization | Software | Ref | Web App |
| R | Other | Software | Ref | |
| Tidyverse | Other | Software | Ref | |
| EnhancedVolcano | Other | Software | Ref |
Data Preparation - Create metadata
Generate a metadata.csv file containing information about the dataset. The metadata will be used for processing and differential expression analysis.
::: {.column width = “30%”} Metadata Header
::: {.column width = “70%”} Description
SRR ID from the SRA run
C57BL/6J
Wild type
male
brain, heart, kidney, liver, lung, muscle, skin
6, 15, 24, 30 (months)
1,2,3,4
brain_6mo, heart_6mo, ...
brain_6mo_br1, brain_6mo_br2, ...
SRR23517590_1.fastq.gz, SRR23517591_1.fastq.gz, ...
SRR23517590_2.fastq.gz, SRR23517591_2.fastq.gz, ...An extensive metadata file describes the samples in detail. However, the minimum metadata file should contain the following:
- samplename (sometime it’s the same name as the fastq file),
- fq1/fq2
- condition/treatment/genotype (factor to run comparison)
Other metadata info can include:
- time point
- technical replicate
- sequencing batch
- library kits
- cell line
- etc…
This file contains the metadata for the full dataset (all 112 samples)
samplename,fq1,fq2,srr_id,sex,strain,organ,age,biorep,factor
Skin_30mo_br1,SRR23517578_1.fastq.gz,SRR23517578_2.fastq.gz,SRR23517578,male,C57BL/6J,Skin,30mo,1,Skin_30mo
Skin_30mo_br2,SRR23517579_1.fastq.gz,SRR23517579_2.fastq.gz,SRR23517579,male,C57BL/6J,Skin,30mo,2,Skin_30mo
Skin_30mo_br3,SRR23517580_1.fastq.gz,SRR23517580_2.fastq.gz,SRR23517580,male,C57BL/6J,Skin,30mo,3,Skin_30mo
Skin_30mo_br4,SRR23517581_1.fastq.gz,SRR23517581_2.fastq.gz,SRR23517581,male,C57BL/6J,Skin,30mo,4,Skin_30mo
Skin_24mo_br1,SRR23517582_1.fastq.gz,SRR23517582_2.fastq.gz,SRR23517582,male,C57BL/6J,Skin,24mo,1,Skin_24mo
Skin_24mo_br2,SRR23517583_1.fastq.gz,SRR23517583_2.fastq.gz,SRR23517583,male,C57BL/6J,Skin,24mo,2,Skin_24mo
Skin_24mo_br3,SRR23517584_1.fastq.gz,SRR23517584_2.fastq.gz,SRR23517584,male,C57BL/6J,Skin,24mo,3,Skin_24mo
Skin_24mo_br4,SRR23517585_1.fastq.gz,SRR23517585_2.fastq.gz,SRR23517585,male,C57BL/6J,Skin,24mo,4,Skin_24mo
Skin_15mo_br1,SRR23517586_1.fastq.gz,SRR23517586_2.fastq.gz,SRR23517586,male,C57BL/6J,Skin,15mo,1,Skin_15mo
Skin_15mo_br2,SRR23517587_1.fastq.gz,SRR23517587_2.fastq.gz,SRR23517587,male,C57BL/6J,Skin,15mo,2,Skin_15mo
Skin_15mo_br3,SRR23517588_1.fastq.gz,SRR23517588_2.fastq.gz,SRR23517588,male,C57BL/6J,Skin,15mo,3,Skin_15mo
Skin_15mo_br4,SRR23517589_1.fastq.gz,SRR23517589_2.fastq.gz,SRR23517589,male,C57BL/6J,Skin,15mo,4,Skin_15mo
Skin_6mo_br1,SRR23517590_1.fastq.gz,SRR23517590_2.fastq.gz,SRR23517590,male,C57BL/6J,Skin,6mo,1,Skin_6mo
Skin_6mo_br2,SRR23517591_1.fastq.gz,SRR23517591_2.fastq.gz,SRR23517591,male,C57BL/6J,Skin,6mo,2,Skin_6mo
Skin_6mo_br3,SRR23517592_1.fastq.gz,SRR23517592_2.fastq.gz,SRR23517592,male,C57BL/6J,Skin,6mo,3,Skin_6mo
Skin_6mo_br4,SRR23517593_1.fastq.gz,SRR23517593_2.fastq.gz,SRR23517593,male,C57BL/6J,Skin,6mo,4,Skin_6mo
Kidney_30mo_br1,SRR23517594_1.fastq.gz,SRR23517594_2.fastq.gz,SRR23517594,male,C57BL/6J,Kidney,30mo,1,Kidney_30mo
Kidney_30mo_br2,SRR23517595_1.fastq.gz,SRR23517595_2.fastq.gz,SRR23517595,male,C57BL/6J,Kidney,30mo,2,Kidney_30mo
Kidney_30mo_br3,SRR23517596_1.fastq.gz,SRR23517596_2.fastq.gz,SRR23517596,male,C57BL/6J,Kidney,30mo,3,Kidney_30mo
Kidney_30mo_br4,SRR23517597_1.fastq.gz,SRR23517597_2.fastq.gz,SRR23517597,male,C57BL/6J,Kidney,30mo,4,Kidney_30mo
Kidney_24mo_br1,SRR23517598_1.fastq.gz,SRR23517598_2.fastq.gz,SRR23517598,male,C57BL/6J,Kidney,24mo,1,Kidney_24mo
Kidney_24mo_br2,SRR23517599_1.fastq.gz,SRR23517599_2.fastq.gz,SRR23517599,male,C57BL/6J,Kidney,24mo,2,Kidney_24mo
Kidney_24mo_br3,SRR23517600_1.fastq.gz,SRR23517600_2.fastq.gz,SRR23517600,male,C57BL/6J,Kidney,24mo,3,Kidney_24mo
Kidney_24mo_br4,SRR23517601_1.fastq.gz,SRR23517601_2.fastq.gz,SRR23517601,male,C57BL/6J,Kidney,24mo,4,Kidney_24mo
Kidney_15mo_br1,SRR23517602_1.fastq.gz,SRR23517602_2.fastq.gz,SRR23517602,male,C57BL/6J,Kidney,15mo,1,Kidney_15mo
Kidney_15mo_br2,SRR23517603_1.fastq.gz,SRR23517603_2.fastq.gz,SRR23517603,male,C57BL/6J,Kidney,15mo,2,Kidney_15mo
Kidney_15mo_br3,SRR23517604_1.fastq.gz,SRR23517604_2.fastq.gz,SRR23517604,male,C57BL/6J,Kidney,15mo,3,Kidney_15mo
Kidney_15mo_br4,SRR23517605_1.fastq.gz,SRR23517605_2.fastq.gz,SRR23517605,male,C57BL/6J,Kidney,15mo,4,Kidney_15mo
Kidney_6mo_br1,SRR23517606_1.fastq.gz,SRR23517606_2.fastq.gz,SRR23517606,male,C57BL/6J,Kidney,6mo,1,Kidney_6mo
Kidney_6mo_br2,SRR23517607_1.fastq.gz,SRR23517607_2.fastq.gz,SRR23517607,male,C57BL/6J,Kidney,6mo,2,Kidney_6mo
Kidney_6mo_br3,SRR23517608_1.fastq.gz,SRR23517608_2.fastq.gz,SRR23517608,male,C57BL/6J,Kidney,6mo,3,Kidney_6mo
Kidney_6mo_br4,SRR23517609_1.fastq.gz,SRR23517609_2.fastq.gz,SRR23517609,male,C57BL/6J,Kidney,6mo,4,Kidney_6mo
Heart_30mo_br1,SRR23517610_1.fastq.gz,SRR23517610_2.fastq.gz,SRR23517610,male,C57BL/6J,Heart,30mo,1,Heart_30mo
Heart_30mo_br2,SRR23517611_1.fastq.gz,SRR23517611_2.fastq.gz,SRR23517611,male,C57BL/6J,Heart,30mo,2,Heart_30mo
Heart_30mo_br3,SRR23517612_1.fastq.gz,SRR23517612_2.fastq.gz,SRR23517612,male,C57BL/6J,Heart,30mo,3,Heart_30mo
Heart_30mo_br4,SRR23517613_1.fastq.gz,SRR23517613_2.fastq.gz,SRR23517613,male,C57BL/6J,Heart,30mo,4,Heart_30mo
Heart_24mo_br1,SRR23517614_1.fastq.gz,SRR23517614_2.fastq.gz,SRR23517614,male,C57BL/6J,Heart,24mo,1,Heart_24mo
Heart_24mo_br2,SRR23517615_1.fastq.gz,SRR23517615_2.fastq.gz,SRR23517615,male,C57BL/6J,Heart,24mo,2,Heart_24mo
Heart_24mo_br3,SRR23517616_1.fastq.gz,SRR23517616_2.fastq.gz,SRR23517616,male,C57BL/6J,Heart,24mo,3,Heart_24mo
Heart_24mo_br4,SRR23517617_1.fastq.gz,SRR23517617_2.fastq.gz,SRR23517617,male,C57BL/6J,Heart,24mo,4,Heart_24mo
Heart_15mo_br1,SRR23517618_1.fastq.gz,SRR23517618_2.fastq.gz,SRR23517618,male,C57BL/6J,Heart,15mo,1,Heart_15mo
Heart_15mo_br2,SRR23517619_1.fastq.gz,SRR23517619_2.fastq.gz,SRR23517619,male,C57BL/6J,Heart,15mo,2,Heart_15mo
Heart_15mo_br3,SRR23517620_1.fastq.gz,SRR23517620_2.fastq.gz,SRR23517620,male,C57BL/6J,Heart,15mo,3,Heart_15mo
Heart_15mo_br4,SRR23517621_1.fastq.gz,SRR23517621_2.fastq.gz,SRR23517621,male,C57BL/6J,Heart,15mo,4,Heart_15mo
Heart_6mo_br1,SRR23517622_1.fastq.gz,SRR23517622_2.fastq.gz,SRR23517622,male,C57BL/6J,Heart,6mo,1,Heart_6mo
Heart_6mo_br2,SRR23517623_1.fastq.gz,SRR23517623_2.fastq.gz,SRR23517623,male,C57BL/6J,Heart,6mo,2,Heart_6mo
Heart_6mo_br3,SRR23517624_1.fastq.gz,SRR23517624_2.fastq.gz,SRR23517624,male,C57BL/6J,Heart,6mo,3,Heart_6mo
Heart_6mo_br4,SRR23517625_1.fastq.gz,SRR23517625_2.fastq.gz,SRR23517625,male,C57BL/6J,Heart,6mo,4,Heart_6mo
Brain_30mo_br1,SRR23517626_1.fastq.gz,SRR23517626_2.fastq.gz,SRR23517626,male,C57BL/6J,Brain,30mo,1,Brain_30mo
Brain_30mo_br2,SRR23517627_1.fastq.gz,SRR23517627_2.fastq.gz,SRR23517627,male,C57BL/6J,Brain,30mo,2,Brain_30mo
Brain_30mo_br3,SRR23517628_1.fastq.gz,SRR23517628_2.fastq.gz,SRR23517628,male,C57BL/6J,Brain,30mo,3,Brain_30mo
Brain_30mo_br4,SRR23517629_1.fastq.gz,SRR23517629_2.fastq.gz,SRR23517629,male,C57BL/6J,Brain,30mo,4,Brain_30mo
Brain_24mo_br1,SRR23517630_1.fastq.gz,SRR23517630_2.fastq.gz,SRR23517630,male,C57BL/6J,Brain,24mo,1,Brain_24mo
Brain_24mo_br2,SRR23517631_1.fastq.gz,SRR23517631_2.fastq.gz,SRR23517631,male,C57BL/6J,Brain,24mo,2,Brain_24mo
Brain_24mo_br3,SRR23517632_1.fastq.gz,SRR23517632_2.fastq.gz,SRR23517632,male,C57BL/6J,Brain,24mo,3,Brain_24mo
Brain_24mo_br4,SRR23517633_1.fastq.gz,SRR23517633_2.fastq.gz,SRR23517633,male,C57BL/6J,Brain,24mo,4,Brain_24mo
Brain_15mo_br1,SRR23517634_1.fastq.gz,SRR23517634_2.fastq.gz,SRR23517634,male,C57BL/6J,Brain,15mo,1,Brain_15mo
Brain_15mo_br2,SRR23517635_1.fastq.gz,SRR23517635_2.fastq.gz,SRR23517635,male,C57BL/6J,Brain,15mo,2,Brain_15mo
Brain_15mo_br3,SRR23517636_1.fastq.gz,SRR23517636_2.fastq.gz,SRR23517636,male,C57BL/6J,Brain,15mo,3,Brain_15mo
Brain_15mo_br4,SRR23517637_1.fastq.gz,SRR23517637_2.fastq.gz,SRR23517637,male,C57BL/6J,Brain,15mo,4,Brain_15mo
Brain_6mo_br1,SRR23517638_1.fastq.gz,SRR23517638_2.fastq.gz,SRR23517638,male,C57BL/6J,Brain,6mo,1,Brain_6mo
Brain_6mo_br2,SRR23517639_1.fastq.gz,SRR23517639_2.fastq.gz,SRR23517639,male,C57BL/6J,Brain,6mo,2,Brain_6mo
Brain_6mo_br3,SRR23517640_1.fastq.gz,SRR23517640_2.fastq.gz,SRR23517640,male,C57BL/6J,Brain,6mo,3,Brain_6mo
Brain_6mo_br4,SRR23517641_1.fastq.gz,SRR23517641_2.fastq.gz,SRR23517641,male,C57BL/6J,Brain,6mo,4,Brain_6mo
Muscle_30mo_br1,SRR23517642_1.fastq.gz,SRR23517642_2.fastq.gz,SRR23517642,male,C57BL/6J,Muscle,30mo,1,Muscle_30mo
Muscle_30mo_br2,SRR23517643_1.fastq.gz,SRR23517643_2.fastq.gz,SRR23517643,male,C57BL/6J,Muscle,30mo,2,Muscle_30mo
Muscle_30mo_br3,SRR23517644_1.fastq.gz,SRR23517644_2.fastq.gz,SRR23517644,male,C57BL/6J,Muscle,30mo,3,Muscle_30mo
Muscle_30mo_br4,SRR23517645_1.fastq.gz,SRR23517645_2.fastq.gz,SRR23517645,male,C57BL/6J,Muscle,30mo,4,Muscle_30mo
Muscle_24mo_br1,SRR23517646_1.fastq.gz,SRR23517646_2.fastq.gz,SRR23517646,male,C57BL/6J,Muscle,24mo,1,Muscle_24mo
Muscle_24mo_br2,SRR23517647_1.fastq.gz,SRR23517647_2.fastq.gz,SRR23517647,male,C57BL/6J,Muscle,24mo,2,Muscle_24mo
Muscle_24mo_br3,SRR23517648_1.fastq.gz,SRR23517648_2.fastq.gz,SRR23517648,male,C57BL/6J,Muscle,24mo,3,Muscle_24mo
Muscle_24mo_br4,SRR23517649_1.fastq.gz,SRR23517649_2.fastq.gz,SRR23517649,male,C57BL/6J,Muscle,24mo,4,Muscle_24mo
Muscle_15mo_br1,SRR23517650_1.fastq.gz,SRR23517650_2.fastq.gz,SRR23517650,male,C57BL/6J,Muscle,15mo,1,Muscle_15mo
Muscle_15mo_br2,SRR23517651_1.fastq.gz,SRR23517651_2.fastq.gz,SRR23517651,male,C57BL/6J,Muscle,15mo,2,Muscle_15mo
Muscle_15mo_br3,SRR23517652_1.fastq.gz,SRR23517652_2.fastq.gz,SRR23517652,male,C57BL/6J,Muscle,15mo,3,Muscle_15mo
Muscle_15mo_br4,SRR23517653_1.fastq.gz,SRR23517653_2.fastq.gz,SRR23517653,male,C57BL/6J,Muscle,15mo,4,Muscle_15mo
Muscle_6mo_br1,SRR23517654_1.fastq.gz,SRR23517654_2.fastq.gz,SRR23517654,male,C57BL/6J,Muscle,6mo,1,Muscle_6mo
Muscle_6mo_br2,SRR23517655_1.fastq.gz,SRR23517655_2.fastq.gz,SRR23517655,male,C57BL/6J,Muscle,6mo,2,Muscle_6mo
Muscle_6mo_br3,SRR23517656_1.fastq.gz,SRR23517656_2.fastq.gz,SRR23517656,male,C57BL/6J,Muscle,6mo,3,Muscle_6mo
Muscle_6mo_br4,SRR23517657_1.fastq.gz,SRR23517657_2.fastq.gz,SRR23517657,male,C57BL/6J,Muscle,6mo,4,Muscle_6mo
Lung_30mo_br1,SRR23517658_1.fastq.gz,SRR23517658_2.fastq.gz,SRR23517658,male,C57BL/6J,Lung,30mo,1,Lung_30mo
Lung_30mo_br2,SRR23517659_1.fastq.gz,SRR23517659_2.fastq.gz,SRR23517659,male,C57BL/6J,Lung,30mo,2,Lung_30mo
Lung_30mo_br3,SRR23517660_1.fastq.gz,SRR23517660_2.fastq.gz,SRR23517660,male,C57BL/6J,Lung,30mo,3,Lung_30mo
Lung_30mo_br4,SRR23517661_1.fastq.gz,SRR23517661_2.fastq.gz,SRR23517661,male,C57BL/6J,Lung,30mo,4,Lung_30mo
Lung_24mo_br1,SRR23517662_1.fastq.gz,SRR23517662_2.fastq.gz,SRR23517662,male,C57BL/6J,Lung,24mo,1,Lung_24mo
Lung_24mo_br2,SRR23517663_1.fastq.gz,SRR23517663_2.fastq.gz,SRR23517663,male,C57BL/6J,Lung,24mo,2,Lung_24mo
Lung_24mo_br3,SRR23517664_1.fastq.gz,SRR23517664_2.fastq.gz,SRR23517664,male,C57BL/6J,Lung,24mo,3,Lung_24mo
Lung_24mo_br4,SRR23517665_1.fastq.gz,SRR23517665_2.fastq.gz,SRR23517665,male,C57BL/6J,Lung,24mo,4,Lung_24mo
Lung_15mo_br1,SRR23517666_1.fastq.gz,SRR23517666_2.fastq.gz,SRR23517666,male,C57BL/6J,Lung,15mo,1,Lung_15mo
Lung_15mo_br2,SRR23517667_1.fastq.gz,SRR23517667_2.fastq.gz,SRR23517667,male,C57BL/6J,Lung,15mo,2,Lung_15mo
Lung_15mo_br3,SRR23517668_1.fastq.gz,SRR23517668_2.fastq.gz,SRR23517668,male,C57BL/6J,Lung,15mo,3,Lung_15mo
Lung_15mo_br4,SRR23517669_1.fastq.gz,SRR23517669_2.fastq.gz,SRR23517669,male,C57BL/6J,Lung,15mo,4,Lung_15mo
Lung_6mo_br1,SRR23517670_1.fastq.gz,SRR23517670_2.fastq.gz,SRR23517670,male,C57BL/6J,Lung,6mo,1,Lung_6mo
Lung_6mo_br2,SRR23517671_1.fastq.gz,SRR23517671_2.fastq.gz,SRR23517671,male,C57BL/6J,Lung,6mo,2,Lung_6mo
Lung_6mo_br3,SRR23517672_1.fastq.gz,SRR23517672_2.fastq.gz,SRR23517672,male,C57BL/6J,Lung,6mo,3,Lung_6mo
Lung_6mo_br4,SRR23517673_1.fastq.gz,SRR23517673_2.fastq.gz,SRR23517673,male,C57BL/6J,Lung,6mo,4,Lung_6mo
Liver_30mo_br1,SRR23517674_1.fastq.gz,SRR23517674_2.fastq.gz,SRR23517674,male,C57BL/6J,Liver,30mo,1,Liver_30mo
Liver_30mo_br2,SRR23517675_1.fastq.gz,SRR23517675_2.fastq.gz,SRR23517675,male,C57BL/6J,Liver,30mo,2,Liver_30mo
Liver_30mo_br3,SRR23517676_1.fastq.gz,SRR23517676_2.fastq.gz,SRR23517676,male,C57BL/6J,Liver,30mo,3,Liver_30mo
Liver_30mo_br4,SRR23517677_1.fastq.gz,SRR23517677_2.fastq.gz,SRR23517677,male,C57BL/6J,Liver,30mo,4,Liver_30mo
Liver_24mo_br1,SRR23517678_1.fastq.gz,SRR23517678_2.fastq.gz,SRR23517678,male,C57BL/6J,Liver,24mo,1,Liver_24mo
Liver_24mo_br2,SRR23517679_1.fastq.gz,SRR23517679_2.fastq.gz,SRR23517679,male,C57BL/6J,Liver,24mo,2,Liver_24mo
Liver_24mo_br3,SRR23517680_1.fastq.gz,SRR23517680_2.fastq.gz,SRR23517680,male,C57BL/6J,Liver,24mo,3,Liver_24mo
Liver_24mo_br4,SRR23517681_1.fastq.gz,SRR23517681_2.fastq.gz,SRR23517681,male,C57BL/6J,Liver,24mo,4,Liver_24mo
Liver_15mo_br1,SRR23517682_1.fastq.gz,SRR23517682_2.fastq.gz,SRR23517682,male,C57BL/6J,Liver,15mo,1,Liver_15mo
Liver_15mo_br2,SRR23517683_1.fastq.gz,SRR23517683_2.fastq.gz,SRR23517683,male,C57BL/6J,Liver,15mo,2,Liver_15mo
Liver_15mo_br3,SRR23517684_1.fastq.gz,SRR23517684_2.fastq.gz,SRR23517684,male,C57BL/6J,Liver,15mo,3,Liver_15mo
Liver_15mo_br4,SRR23517685_1.fastq.gz,SRR23517685_2.fastq.gz,SRR23517685,male,C57BL/6J,Liver,15mo,4,Liver_15mo
Liver_6mo_br1,SRR23517686_1.fastq.gz,SRR23517686_2.fastq.gz,SRR23517686,male,C57BL/6J,Liver,6mo,1,Liver_6mo
Liver_6mo_br2,SRR23517687_1.fastq.gz,SRR23517687_2.fastq.gz,SRR23517687,male,C57BL/6J,Liver,6mo,2,Liver_6mo
Liver_6mo_br3,SRR23517688_1.fastq.gz,SRR23517688_2.fastq.gz,SRR23517688,male,C57BL/6J,Liver,6mo,3,Liver_6mo
Liver_6mo_br4,SRR23517689_1.fastq.gz,SRR23517689_2.fastq.gz,SRR23517689,male,C57BL/6J,Liver,6mo,4,Liver_6moThis file contains the metadata referencing the samples from 6 month old mice.
samplename,fq1,fq2,srr_id,sex,strain,organ,age,biorep,factor
Skin_6mo_br1,SRR23517590_1.fastq.gz,SRR23517590_2.fastq.gz,SRR23517590,male,C57BL/6J,Skin,6mo,1,Skin_6mo
Skin_6mo_br2,SRR23517591_1.fastq.gz,SRR23517591_2.fastq.gz,SRR23517591,male,C57BL/6J,Skin,6mo,2,Skin_6mo
Skin_6mo_br3,SRR23517592_1.fastq.gz,SRR23517592_2.fastq.gz,SRR23517592,male,C57BL/6J,Skin,6mo,3,Skin_6mo
Skin_6mo_br4,SRR23517593_1.fastq.gz,SRR23517593_2.fastq.gz,SRR23517593,male,C57BL/6J,Skin,6mo,4,Skin_6mo
Kidney_6mo_br1,SRR23517606_1.fastq.gz,SRR23517606_2.fastq.gz,SRR23517606,male,C57BL/6J,Kidney,6mo,1,Kidney_6mo
Kidney_6mo_br2,SRR23517607_1.fastq.gz,SRR23517607_2.fastq.gz,SRR23517607,male,C57BL/6J,Kidney,6mo,2,Kidney_6mo
Kidney_6mo_br3,SRR23517608_1.fastq.gz,SRR23517608_2.fastq.gz,SRR23517608,male,C57BL/6J,Kidney,6mo,3,Kidney_6mo
Kidney_6mo_br4,SRR23517609_1.fastq.gz,SRR23517609_2.fastq.gz,SRR23517609,male,C57BL/6J,Kidney,6mo,4,Kidney_6mo
Heart_6mo_br1,SRR23517622_1.fastq.gz,SRR23517622_2.fastq.gz,SRR23517622,male,C57BL/6J,Heart,6mo,1,Heart_6mo
Heart_6mo_br2,SRR23517623_1.fastq.gz,SRR23517623_2.fastq.gz,SRR23517623,male,C57BL/6J,Heart,6mo,2,Heart_6mo
Heart_6mo_br3,SRR23517624_1.fastq.gz,SRR23517624_2.fastq.gz,SRR23517624,male,C57BL/6J,Heart,6mo,3,Heart_6mo
Heart_6mo_br4,SRR23517625_1.fastq.gz,SRR23517625_2.fastq.gz,SRR23517625,male,C57BL/6J,Heart,6mo,4,Heart_6mo
Brain_6mo_br1,SRR23517638_1.fastq.gz,SRR23517638_2.fastq.gz,SRR23517638,male,C57BL/6J,Brain,6mo,1,Brain_6mo
Brain_6mo_br2,SRR23517639_1.fastq.gz,SRR23517639_2.fastq.gz,SRR23517639,male,C57BL/6J,Brain,6mo,2,Brain_6mo
Brain_6mo_br3,SRR23517640_1.fastq.gz,SRR23517640_2.fastq.gz,SRR23517640,male,C57BL/6J,Brain,6mo,3,Brain_6mo
Brain_6mo_br4,SRR23517641_1.fastq.gz,SRR23517641_2.fastq.gz,SRR23517641,male,C57BL/6J,Brain,6mo,4,Brain_6mo
Muscle_6mo_br1,SRR23517654_1.fastq.gz,SRR23517654_2.fastq.gz,SRR23517654,male,C57BL/6J,Muscle,6mo,1,Muscle_6mo
Muscle_6mo_br2,SRR23517655_1.fastq.gz,SRR23517655_2.fastq.gz,SRR23517655,male,C57BL/6J,Muscle,6mo,2,Muscle_6mo
Muscle_6mo_br3,SRR23517656_1.fastq.gz,SRR23517656_2.fastq.gz,SRR23517656,male,C57BL/6J,Muscle,6mo,3,Muscle_6mo
Muscle_6mo_br4,SRR23517657_1.fastq.gz,SRR23517657_2.fastq.gz,SRR23517657,male,C57BL/6J,Muscle,6mo,4,Muscle_6mo
Lung_6mo_br1,SRR23517670_1.fastq.gz,SRR23517670_2.fastq.gz,SRR23517670,male,C57BL/6J,Lung,6mo,1,Lung_6mo
Lung_6mo_br2,SRR23517671_1.fastq.gz,SRR23517671_2.fastq.gz,SRR23517671,male,C57BL/6J,Lung,6mo,2,Lung_6mo
Lung_6mo_br3,SRR23517672_1.fastq.gz,SRR23517672_2.fastq.gz,SRR23517672,male,C57BL/6J,Lung,6mo,3,Lung_6mo
Lung_6mo_br4,SRR23517673_1.fastq.gz,SRR23517673_2.fastq.gz,SRR23517673,male,C57BL/6J,Lung,6mo,4,Lung_6mo
Liver_6mo_br1,SRR23517686_1.fastq.gz,SRR23517686_2.fastq.gz,SRR23517686,male,C57BL/6J,Liver,6mo,1,Liver_6mo
Liver_6mo_br2,SRR23517687_1.fastq.gz,SRR23517687_2.fastq.gz,SRR23517687,male,C57BL/6J,Liver,6mo,2,Liver_6mo
Liver_6mo_br3,SRR23517688_1.fastq.gz,SRR23517688_2.fastq.gz,SRR23517688,male,C57BL/6J,Liver,6mo,3,Liver_6mo
Liver_6mo_br4,SRR23517689_1.fastq.gz,SRR23517689_2.fastq.gz,SRR23517689,male,C57BL/6J,Liver,6mo,4,Liver_6moExamining the raw data (FASTQ)
Fastq files contain the raw sequence and the base quality score. Each sequence is comprised of four lines.
Description for each line
The sequence header (starts with @) and contain identifiers for the read
The sequence
'+'
Base quality scores @VH01192:45:AAC7JMMM5:1:1101:19973:1000 1:N:0:AGTTCAGG+TCTGTTGG
NATGGGACAGACATGCTGGCGGCACTCACTCACTTGGGCGGCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTTCAGGATCTCGTATTCCCTTTTTTTTTTGTAAATTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
#-C;;CCCCC-C-CCCCCCCCC-CCCCCCCCCCCCCCCCCCC-CCCCCCCC-CC-CCCCCCCCC-CC-CCCCCCCCC;-C--CC-CCCC--CC-C--;---;C----;-;CC-------C-C-C-C--CCCC-;C;CCC-CCCCCCCCCCC
@VH01192:45:AAC7JMMM5:1:1101:20125:1000 1:N:0:AGTTCAGG+TCTGTTGG
NCCCAGCCCCAGCGACTCCTAATAAAGCATTTCAGCAAATAAAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCAAGTCACAGTTCAGGATGTGGTTTTTGGTTTTTTTTTTTTTAAATTTTGGGGGGGGGGGGGGGGGGGG
+
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC--CCCCCCCCCCCCCCCC;CCCC-CCCCCCCCCC---C-C---C-CC-CCCCC;CC-CC-C--;C-CC-------C-C---C-C----CC;C-CCCC;C;CCCCCCCCCCCC
@VH01192:45:AAC7JMMM5:1:1101:21034:1000 1:N:0:AGTTCAGG+TCTGTTGG
NTTGCAATGCTCAATAAGTCTATTCCACCTCAGTGTCCTTTTTAAAGAGTTTTGGAAAAAAAAAAAAAAAAAAAGATCGTAAGAGCACACGTCTGAACTCCAGTCACAGTTCAGGTTGTGGGTTTTCGTGTTTTTGTTTTTATTTTTGGGG
+
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CCCCC;C;CCCCCCCCCCCCCCCCCCCCCCCCCC-CC;C-CC-C;CCCCCC;C;CC-CCCCC;;;CCC-;;C;-;-C---C-C;;;--C-CCCC---C;C-----C;--C-
@VH01192:45:AAC7JMMM5:1:1101:21488:1000 1:N:0:AGTTCAGG+TCTGTTGG
NCCTCAAAAAAAAAAAAAAAAAAAAAAAAAATTTGGTATGTGAAATTTTTTTAATACATTTAAATTTTATGTTTTTGTTTTTCTTTTTTTTTTTTTTAAATATTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCC-;---C-C;-;------CCCCCC;--CC--C-C-;-C-C--;C;-----C-C--CCCC-C--;C-C---C-;C;-C-C;CCCCCCCCC;CCCCC-CCCCCCCCCCC;C;CCCCCCCC;CCCExploring the FASTQ file header
Sample header:
@VH01192:45:AAC7JMMM5:1:1101:19973:1000 1:N:0:AGTTCAGG+TCTGTTGG
Header Description
unique instrument id
run id
flowcell id
flowcell lane
tile number within the flowcell lane
x-coordinate of the cluster within the tile
y-coordinate of the cluster within the tile
Member of a read/mate pair, 1 or 2
Y if the read is filtered, N otherwise
0 when none of the control bits are on
Index sequenceSequence Quality Control (QC)
We will run QC on the raw data using fastqc and trim_galore.
fastqc will generate stats of the raw reads including:
- Base quality score distribution
- GC content
- Sequence duplication
trim_galore will remove adapters in the sequence and can remove N bases or bases with low quality score. trim_galore will also run fastqc on the trimmed dataset.
$ ls analysis/fastqc
SRR23517590_1_fastqc.html
SRR23517590_1_fastqc.zip
SRR23517590_2_fastqc.html
SRR23517590_2_fastqc.zip
SRR23517591_1_fastqc.html
SRR23517591_1_fastqc.zip
SRR23517591_2_fastqc.html
SRR23517591_2_fastqc.zip
SRR23517592_1_fastqc.html
SRR23517592_1_fastqc.zip
SRR23517592_2_fastqc.htmlThe ‘val_1’ and ‘val_2’ files are the validated files after trim_galore processing. These files will be the input for the STAR alignment.
$ ls analysis/trim_galore
Brain_6mo_br1_val_1_fastqc.html
Brain_6mo_br1_val_1_fastqc.zip
Brain_6mo_br1_val_1.fq.gz
Brain_6mo_br1_val_2_fastqc.html
Brain_6mo_br1_val_2_fastqc.zip
Brain_6mo_br1_val_2.fq.gz
Brain_6mo_br2_val_1_fastqc.html
Brain_6mo_br2_val_1_fastqc.zip
Brain_6mo_br2_val_1.fq.gz
Brain_6mo_br2_val_2_fastqc.html
Brain_6mo_br2_val_2_fastqc.zip
Brain_6mo_br2_val_2.fq.gzBuilding an index for the reference genome
To run the STAR aligner, we first need to create an index of the genome.
To generate an index, you’ll need the following files:
- FASTA: Fasta file containing sequences of all the chromosomes/scaffolds for the genome
- GTF/GFF: General Transfer Format (GTF) or General Feature Format (GFF)
You only need to run the indexing step once. The indexed files can be used for alignments for all subsequent samples
For assembled genomes containing large number of scaffolds or large genomes, this process can take some time and may require adjusting the indexing parameters.
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNchrM ncbiRefSeq.2020-10-27 transcript 15356 15422 . - . gene_id "TrnP"; transcript_id "rna-TrnP"; gene_name "TrnP";
chrM ncbiRefSeq.2020-10-27 exon 15356 15422 . - . gene_id "TrnP"; transcript_id "rna-TrnP"; exon_number "1"; exon_id "rna-TrnP.1"; gene_name "TrnP";
chrM ncbiRefSeq.2020-10-27 transcript 15289 15355 . + . gene_id "TrnT"; transcript_id "rna-TrnT"; gene_name "TrnT";
chrM ncbiRefSeq.2020-10-27 exon 15289 15355 . + . gene_id "TrnT"; transcript_id "rna-TrnT"; exon_number "1"; exon_id "rna-TrnT.1"; gene_name "TrnT";
chrM ncbiRefSeq.2020-10-27 transcript 14145 15288 . + . gene_id "CYTB"; transcript_id "NP_904340.1"; gene_name "CYTB";
chrM ncbiRefSeq.2020-10-27 exon 14145 15288 . + . gene_id "CYTB"; transcript_id "NP_904340.1"; exon_number "1"; exon_id "NP_904340.1.1"; gene_name "CYTB";
chrM ncbiRefSeq.2020-10-27 CDS 14145 15288 . + 0 gene_id "CYTB"; transcript_id "NP_904340.1"; exon_number "1"; exon_id "NP_904340.1.1"; gene_name "CYTB";
chrM ncbiRefSeq.2020-10-27 start_codon 14145 14147 . + 0 gene_id "CYTB"; transcript_id "NP_904340.1"; exon_number "1"; exon_id "NP_904340.1.1"; gene_name "CYTB";Sequence alignment with STAR
Now we can align the reads from each sample to the indexed reference genome.
Depending on the sequencing depth, this step can take some time.
$ ls analysis/star
Skin_6mo_br1_Aligned.sortedByCoord.out.bam
Skin_6mo_br1_Aligned.sortedByCoord.out.bam.bai
Skin_6mo_br1.bw
Skin_6mo_br1_Log.final.out
Skin_6mo_br1_Log.out
Skin_6mo_br1_Log.progress.out
Skin_6mo_br1_ReadsPerGene.out.tab
Skin_6mo_br1_Signal.UniqueMultiple.str1.out.wig
Skin_6mo_br1_Signal.UniqueMultiple.str2.out.wig
Skin_6mo_br1_Signal.Unique.str1.out.wig
Skin_6mo_br1_Signal.Unique.str2.out.wig
Skin_6mo_br1_SJ.out.tab
Skin_6mo_br2_Aligned.sortedByCoord.out.bam
Skin_6mo_br2_Log.out
Skin_6mo_br2_Log.progress.out
Skin_6mo_br2__STARtmp$ less analysis/star/wt_1_Log.final.out
Started job on | Dec 10 21:54:39
Started mapping on | Dec 10 21:54:53
Finished on | Dec 10 22:49:04
Mapping speed, Million of reads per hour | 17.08
Number of input reads | 15426551
Average input read length | 267
UNIQUE READS:
Uniquely mapped reads number | 12718109
Uniquely mapped reads % | 82.44%
Average mapped length | 260.81
Number of splices: Total | 9531195
Number of splices: Annotated (sjdb) | 9457319
Number of splices: GT/AG | 9437998
Number of splices: GC/AG | 73455
Number of splices: AT/AC | 5941
Number of splices: Non-canonical | 13801
Mismatch rate per base, % | 0.20%
Deletion rate per base | 0.01%
Deletion average length | 2.02
Insertion rate per base | 0.00%
Insertion average length | 1.32
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 0
% of reads mapped to multiple loci | 0.00%
Number of reads mapped to too many loci | 1717783
% of reads mapped to too many loci | 11.14%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 979911
% of reads unmapped: too short | 6.35%
Number of reads unmapped: other | 10748
% of reads unmapped: other | 0.07%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%Anatomy of the alignment file
The alignment information are stored in a sequence alignment map (SAM) file. This file can be large depending on the number of sequenced reads. Therefore, SAM files are generally converted to a BAM (binary compressed version of SAM) file to reduce storage space but requires software (e.g., samtools) in order to see the file content.
Sequence alignment map (SAM) format official documentation is available here. The SAM file consists of header rows and rows for each read. Each row contains 11 mandatory fields.
@HD VN:1.4 SO:coordinate
@SQ SN:chr1 LN:195154279
@SQ SN:chr10 LN:130530862
@SQ SN:chr11 LN:121973369
@SQ SN:chr12 LN:120092757
@SQ SN:chr13 LN:120883175
@SQ SN:chr14 LN:125139656
@SQ SN:chr15 LN:104073951
@SQ SN:chr16 LN:98008968
@SQ SN:chr17 LN:95294699
@SQ SN:chr18 LN:90720763
@SQ SN:chr19 LN:61420004
@SQ SN:chr1_GL456210v1_random LN:169725
@SQ SN:chr1_GL456211v1_random LN:241735
@SQ SN:chr1_GL456212v1_random LN:153618
@SQ SN:chr1_GL456221v1_random LN:206961
@SQ SN:chr1_MU069434v1_random LN:8412
@SQ SN:chr1_GL456239v1_random LN:40056
@SQ SN:chr2 LN:181755017
@SQ SN:chr3 LN:159745316
@SQ SN:chr4 LN:156860686
@SQ SN:chr4_JH584295v1_random LN:1976
@SQ SN:chr5 LN:151758149
@SQ SN:chr5_JH584296v1_random LN:199368
@SQ SN:chr5_JH584297v1_random LN:205776
@SQ SN:chr5_JH584298v1_random LN:184189
@SQ SN:chr5_GL456354v1_random LN:195993
@SQ SN:chr5_JH584299v1_random LN:953012
@SQ SN:chr6 LN:149588044
@SQ SN:chr7 LN:144995196
@SQ SN:chr7_GL456219v1_random LN:175968
@SQ SN:chr8 LN:130127694
@SQ SN:chr9 LN:124359700
@SQ SN:chrM LN:16299
@SQ SN:chrUn_GL456359v1 LN:22974
@SQ SN:chrUn_GL456360v1 LN:31704
@SQ SN:chrUn_GL456366v1 LN:47073
@SQ SN:chrUn_GL456367v1 LN:42057
@SQ SN:chrUn_GL456368v1 LN:20208
@SQ SN:chrUn_GL456370v1 LN:26764
@SQ SN:chrUn_GL456372v1 LN:28664
@SQ SN:chrUn_GL456378v1 LN:31602
@SQ SN:chrUn_GL456379v1 LN:72385
@SQ SN:chrUn_GL456381v1 LN:25871
@SQ SN:chrUn_GL456382v1 LN:23158
@SQ SN:chrUn_GL456383v1 LN:38659
@SQ SN:chrUn_GL456385v1 LN:35240
@SQ SN:chrUn_GL456387v1 LN:24685
@SQ SN:chrUn_GL456389v1 LN:28772
@SQ SN:chrUn_GL456390v1 LN:24668
@SQ SN:chrUn_GL456392v1 LN:23629
@SQ SN:chrUn_GL456394v1 LN:24323
@SQ SN:chrUn_GL456396v1 LN:21240
@SQ SN:chrUn_JH584304v1 LN:114452
@SQ SN:chrUn_MU069435v1 LN:31129
@SQ SN:chrX LN:169476592
@SQ SN:chrX_GL456233v2_random LN:559103
@SQ SN:chrY LN:91455967
@SQ SN:chrY_JH584300v1_random LN:182347
@SQ SN:chrY_JH584301v1_random LN:259875
@SQ SN:chrY_JH584302v1_random LN:155838
@SQ SN:chrY_JH584303v1_random LN:158099
@PG ID:STAR PN:STAR VN:2.7.10b CL:STAR --runMode alignReads --runThreadN 8 --genomeDir index/mm39-star --readFilesIn analysis/trimga
lore/Brain_6mo_br1_val_1.fq.gz analysis/trimgalore/Brain_6mo_br1_val_2.fq.gz --readFilesCommand zcat --outFileNamePrefix analysis/star/Brain
_6mo_br1_ --outSAMtype BAM SortedByCoordinate --outWigType wiggle --outFilterMultimapNmax 1 --outFilterMismatchNmax 2 --quantMode Gene
Counts
@PG ID:samtools PN:samtools PP:STAR VN:1.22.1 CL:samtools view -h analysis/star/Brain_6mo_br1_Aligned.sortedByCoord.out.bam
@CO user command line: STAR --runThreadN 8 --runMode alignReads --genomeDir index/mm39-star --readFilesCommand zcat --outSAMtype BAM SortedByCoordin
ate --outFileNamePrefix analysis/star/Brain_6mo_br1_ --outFilterMismatchNmax 2 --outFilterMultimapNmax 1 --quantMode GeneCounts --outWigType wiggle --re
adFilesIn analysis/trimgalore/Brain_6mo_br1_val_1.fq.gz analysis/trimgalore/Brain_6mo_br1_val_2.fq.gz
SRR23517638.4811943 99 chr1 3134107 255 1S150M = 3134145 188 AGCCAGGTCTCTTTTTTGTAATCAATTTTGTTCGCTGTTCTCTAGAAAAAAGACATATTTGCCC
TCAATCCATATTTTGCTATAGTATATTGAACTGTAGTGTTGCAGACTCCTAACAGAGGAATTATTAGAAGTATTCATGTGGGCTGAA ,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFF NH:i:1 HI:i:1 AS:i:298 nM:i:0
SRR23517638.4811943 147 chr1 3134145 255 150M1S = 3134107 -188 CTCTAGAAAAAAGACATATTTGCCCTCAATCCATATTTTGCTATAGTATATTGAACTGTAGTGT
TGCAGACTCCTAACAGAGGAATTATTAGAAGTATTCATGTGGGCTGAAGAAAACAATAGGTTCTAGTGAGTGGTTTGCAATTCCTTA FFF,FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFF: NH:i:1 HI:i:1 AS:i:298 nM:i:0
SRR23517638.4277648 99 chr1 3147363 255 107M20S = 3147363 107 TCTTATATTGTCTCCAGCAAAAGGTGTAGCCCAGATTAAAGGTGTGTTCCACCACACCTTTAAT
CCCAGATGAAAGGCATAGCCCAGATTAAAGGTGTGTTCCTTAACTCGGAGATTCAATCTTCTA ,:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:208 nM:i:2
SRR23517638.4277648 147 chr1 3147363 255 107M20S = 3147363 -107 TCTTATATTGTCTCCAGCAAAAGGTGTAGCCCAGATTAAAGGTGTGTTCCACCACACCTTTAAT
CCCAGATGAAAGGCATAGCCCAGATTAAAGGTGTGTTCCTTAACTCGGAGATTCAATCTTCTA F,FFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFF:FFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF, NH:i:1 HI:i:1 AS:i:208 nM:i:2SAM Header
A00738:657:H72KHDSX7:1:1327:12717:27023
99
chr01
3008
255
149M
=
3086
227
CGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACAA
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF
NH:i:1 HI:i:1 AS:i:296 nM:i:0
For decoding the SAM Flags, check out this website from the Broad Institute.
The CIGAR string is a representation of how the read aligned to the reference, including matches, mismatches, deletions, insertions, and splicing. More info on the CIGAR string is available here
(Optional) - Convert BAM to BigWig
The BAM file produced from STAR contains alignment information that can be viewed in IGV. However, this file can be big and sluggish. One option is to convert the BAM file to a smaller BigWig format that will allow fast visualization of the results.
The BigWig format will aggregate and bin your data across the genome. Therefore, you will lose individual read information. For observing SNPs, you’ll want to use the BAM files instead.
Quantifying read counts using featureCounts
After aligning the reads to the genome, we must quantify the total reads associated with each genome feature (e.g., gene).
# Program:featureCounts v2.0.3; Command:"featureCounts" "-T" "8" "-s" "0" "-a" "genome/Araport11_GFF3_genes_transposons.201606.gtf" "-t" "exon" "-g" "gene_id" "--primary" "-o" "analysis/featurecounts/mir163_1.fcnts.txt" "-p" "--countReadPairs" "analysis/star/mir163_1_Aligned.sortedByCoord.out.bam"
G# Program:featureCounts v2.0.3; Command:"featureCounts" "-T" "8" "-s" "2" "-a" "genome/mm39.ncbiRefSeq.gtf" "-t" "exon" "-g" "gene_id" "--primary" "-o" "analysis/featurecounts/Brain_6mo_br1.fcnts.txt" "-p" "--countReadPairs" "analysis/star/Brain_6mo_br1_Aligned.sortedByCoord.out.bam"
Geneid Chr Start End Strand Length analysis/star/Brain_6mo_br1_Aligned.sortedByCoord.out.bam
TrnP chrM 15356 15422 - 67 36
TrnT chrM 15289 15355 + 67 33
CYTB chrM 14145 15288 + 1144 68546
TrnE chrM 14071 14139 - 69 2
ND6 chrM 13552 14070 - 519 223
ND5 chrM 11742 13565 + 1824 38843
TrnL2 chrM 11671 11741 + 71 0
TrnS2 chrM 11613 11671 + 59 0
TrnH chrM 11546 11612 + 67 0
ND4 chrM 10167 11544 + 1378 37498
ND4L chrM 9877 10173 + 297 172
TrnR chrM 9808 9875 + 68 0
ND3 chrM 9459 9806 + 348 205
TrnG chrM 9391 9458 + 68 0
COX3 chrM 8607 9390 + 784 1192
ATP6 chrM 7927 8607 + 681 1250
ATP8 chrM 7766 7969 + 204 145
TrnK chrM 7700 7764 + 65 0
COX2 chrM 7013 7696 + 684 1183
TrnD chrM 6942 7011 + 70 0
TrnS1 chrM 6870 6938 - 69 0
COX1 chrM 5328 6872 + 1545 157088
TrnY chrM 5260 5326 - 67 1
TrnC chrM 5192 5257 - 66 0
TrnN chrM 5089 5159 - 71 0Differential expression analysis with DESeq2
DESeq2 is an R package that allows differential expression analysis.
To run DESeq2, you will need two key inputs:
- Gene count matrix (i.e., read counts for each gene) (featureCounts output)
- Sample information file (metadata.csv)
You will also need to define the type of contrasts/design for the analysis.
For example, suppose we have the following metadata for a dataset:
| sample | replicate | genotype | treatment | grp |
|---|---|---|---|---|
| wt_1 | 1 | wild-type | ctrl | wt_ctrl |
| wt_2 | 2 | wild-type | ctrl | wt_ctrl |
| wt_3 | 1 | wild-type | inhibitor | wt_inhib |
| wt_4 | 2 | wild-type | inhibitor | wt_inhib |
| mut_1 | 1 | mutant | ctrl | mut_ctrl |
| mut_2 | 2 | mutant | ctrl | mut_ctrl |
| mut_3 | 1 | mutant | inhibitor | mut_inhib |
| mut_4 | 2 | mutant | inhibitor | mut_inhib |
We can run the following contrasts:
Contrasts
- contrast = ~ genotype
- contrast = ~ treatment
- contrast = ~ genotype + treatment + genotype:treatment
- contrast = ~ grp
Comparisons
- Compares wild-type vs mutant (irrrespective of treatment)
- Compares ctrl vs inhibitor (irrespective of genotype)
- Compares genotype + treatment and their interactions
- Compares wt_ctrl vs wt_inhib, wt_ctrl vs mut_ctrl, wt_ctrl vs mut_inhib
DESeq2 (pre-processing output)
The custom DESeq2 script generates several plots, tables, and genelists organized into the following directories.
Outputs are organized into the genelists or plots folders.
The genelists folder contains a summary of DEGs, all the deseq2 outputs (stored as an R object (RDS) and a CSV file, and DEGs with annotations.
$ ls -1 analysis/deseq2/genotype/genelists/
deg_summary.csv
deseq2_results.RDS
miR163_mut-wt.deseq2_output.csv
miR163_mut-wt.DownDEG.annotated.csv
miR163_mut-wt.UpDEG.annotated.csvThe plots folder contains a bar plot summarizing the number of DEGs, a PCA plot, and an enhanced volcano plot highlighting the DEGs.
$ ls -1 analysis/deseq2/genotype/plots/*png
analysis/deseq2/genotype/plots/DEG_summary.png
analysis/deseq2/genotype/plots/Enhancedvolcano_miR163_mut-vs-wt.png
analysis/deseq2/genotype/plots/PCA.png| Comparisons | Counts_Up_or_Down | Counts_Up | Counts_Down |
|---|---|---|---|
| miR163_mut-wt | 3 | 2 | 1 |
Summary using MultiQC
MultiQC is used to generate a summary of the raw and processed data. The program aggregates all the log files from different programs (e.g., trim_galore, STAR) and provide a summary HTML file.
Data Visualization with IGV
To inspect and examine your RNA-seq results, you can load the BigWig or BAM files into IGV.
Download the .bam, .bam.bai, and .bw files. The .bam files are quite large (> 1Gb) whereas the .bw files are smaller (Mb). These files can be loaded into IGV for visualization and inspection.
Show IGV demonstration
Hands-On Session
We will have a hands-on practice with the analysis workflow using a Jupyter notebook.
Feedback
Please take a few minutes to provide feedback for this workshop by filling out the workshop survey.
All responses are anonymous and will help to improve future workshops and training.
Additional Slides
Downloading/Uploading files from the server
There will be times where you would want to see plots or access files generated on the cluster. For easier access, you can use an SFTP program with a GUI that connects to the server and provides direct access to those files.
There are several free GUI applications including Filezilla, Cyberduck, WinSCP.
These applications are available for MacOS, WinOS, and Linux platforms.
Once you have the application installed, start up the program and follow the instructions to connect to a site.
Click here for instructions to connect and authenticate to the cluster using Filezilla with password + DUO authentication.
Submitting jobs on the cluster
For this workshop, we will submit jobs to the cluster (to run in the background).
Example job submission:
The submitted job will have a six-digit JOBID. To check on the status of your submitted job(s),
# Show all jobs from USER
squeue -u <USERNAME>
squeue --me
# Show specific job with JOBID
scontrol show job <JOBID>To cancel a submitted job, use the JOBID.
# To delete a single job
scancel -i <JOBID>
scancel <JOBID>
# To delete all jobs for a user
scancel -u <USERNAME>All submitted jobs will generate a log file (stored in the log folder) containing the JOBNAME_JOBID.log
- JOBNAME (e.g., fastqc, trim_galore)
You can take a look at the log file to see standard output/errors from running the script.
Create conda environment
We will create a conda environment containing bioinformatics software packages not readily available on the cluster for data processing and analysis. This conda environment includes special R packages (e.g., EnhancedVolcano) and software (e.g., multiqc).
To create the conda environment:
Note: This step can take anywhere from 25-45 minutes. However, it’s running in the background so we can come back to it later.
To make sure the conda environment was created successfully, we can activate the conda environment:
You should now see the environment (DEG-analysis) next to your $ prompt.
Setting up Bash for Jupyter notebook
We will install Bash for Jupyter notebook to enable running shell scripts within the notebook.
First, open a terminal session within the Jupyter Lab.
In the terminal, we’ll first create a conda environment and then install the bash-kernel.
- create a conda environment name jupyter-nb
- activate the conda environment
- install the bash kernel
- Add bash to spec list
If the Bash kernel doesn’t appear as an option for your notebook, you can shutdown all kernels and try again